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^ Abstract 
> 

'— ' We introduce the anti-profile Support Vector Machine (apSVM) as a novel al- 

I gorithm to address the anomaly classification problem, an extension of anomaly 

detection where the goal is to distinguish data samples from a number of anoma- 
lous and heterogeneous classes based on their pattern of deviation from a normal 
stable class. We show that under heterogeneity assumptions defined here that the 
apSVM can be solved as the dual of a standard SVM with an indirect kernel that 
measures similarity of anomalous samples through similarity to the stable normal 
• class. We characterize this indirect kernel as the inner product in a Reproduc- 

ing Kernel Hilbert Space between representers that are projected to the subspace 
spanned by the representers of the normal samples. We show by simulation and 
application to cancer genomics datasets that the anti-profile SVM produces clas- 
• • sifiers that are more accurate and stable than the standard SVM in the anomaly 

, classification setting. 



> 



X 



1 Introduction 



The task of anomaly, or outlier, detection fT2l [8] |T| is to identify data samples that deviate signifi- 
cantly from a class for which training samples are available. We explore anomaly classification as an 
extension to this setting, where the goal is to distinguish data samples from a number of anomalous 
and heterogeneous classes based on their pattern of deviation from a normal stable class. Specifi- 
cally, presented with samples from a normal class, along with samples from 2 or more anomalous 
classes, we want to train a classifier to distinguish samples from the anomalous classes. Since the 
anomalous classes are heterogeneous using deviation from the normal class as the basis of classifi- 
cation instead of building a classifier for the anomalous classes that ignores samples from the normal 
class may lead to classifiers and results that are more stable and reproducible. 

The motivation for exploring this learning setting is from recent results in cancer genomics In 
particular, it was shown that hyper-variability in certain genomic measurements (DNA methylation 
and gene expression) in specific regions is a stable cancer mark across many tumor types. Further- 
more, this hyper- variability increases during stages of cancer progression. This led us to the question 
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Figure 1: (A) Principal component analysis of DNA methylation data |I3]. Variability in methylation 
measurements increases from normal to benign lesions (adenoma) to malignant lesions (cancer). The 
heterogeneity of the adenoma and cancer anomalous samples is the defining feature of the anomaly 
classification setting. (B) and (C) illustration of the heterogeneity assumption in Definition [T] For 
both a linear and radial basis function kernel, the magnitude of eigenvalues of the kernel matrix is 
larger for the anomalous classes. 



of how to distinguish samples from different stages in the presence of hyper-variability. In essence, 
how to distinguish samples from different anomalous classes (given by cancer progression stage) 
based on deviation from a well-defined normal class (measurements from non-cancerous samples). 

We introduce the anti-profile Support Vector Machine (apSVM) as a novel algorithm suitable for 
the anomaly classification task. It is based on the idea of only using the stable normal class to 
define basis functions over which the classifier is defined. We show that the dual of the apSVM 
optimization problem is the same as the dual of the standard S VM with a modified kernel function. 
We then show that this modified kernel function has general properties that ensure better stability 
than the standard SVM in the anomaly classification task. 

The paper is organized as follows: we first present the anomaly classification setting in detail; we 
next describe the Anti-Profile Support Vector Machine (apSVM), and show that the dual of the opti- 
mization problem defined by it is equivalent to the dual of the standard SVM with a specific kernel 
modification; we next show that this kernel modification leads directly to a theoretical statement of 
the stability of the apSVM compared to the standard SVM in the anomaly classification setting; we 
next show simulation results describing the performance and stability of the apS VM; and finally, we 
present results from cancer genomics showing the benefit of approaching classification problems in 
this area from the anomaly classification point of view. 



2 The anomaly classification problem 

We present the anomaly classification problem in the binary case, with two anomalous classes. 
Assume we are given training samples in W from three classes: m datapoints from normal class 
Z, and n training datapoints as pairs {xi, j/i), . . . , {xn,Vn) with labels yi E {—1, 1} indicating 
membership of Xi in one of two anomalous classes A~ and Furthermore, we assume that 
the anomalous classes are heterogeneous with respect to normal class Z. Figure la illustrates this 
learning setting for DNA methylation data |3 1 (see[5]for details on this aspect of cancer epigenetics). 
It is a two-dimensional embedding (using PCA) of DNA methylation data for normal colon tissues 
along with benign growths (adenomas) and cancerous growths (tumor). Variability in these specific 
measurements increases from normal to adenoma to tumor. We would like to build stable and robust 
classifiers that distinguish benign growths from tumors. 
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Next we seek to formalize the heterogeneity assumption of the anomaly classification problem. In- 
tuitively, the heterogeneity assumption we make is that given random samples of the same size 
from the normal class and from the anomalous classes, in expectation, the sample covariance of 
the anomalous samples is always larger than the covariance of the normal samples. We state our 
assumption in the case of Reproducing Kernel Hilbert Spaces (RKHS) since we will use this ma- 
chinery throughout the paper ifTOl [T4ll . Recall that in a Bayesian interpretation of this setting, the 
kernel function associated with a RKHS serves as the covariance of a Gaussian point process. 

Definition 1 (Heterogeneity Assumption). Let be a Reproducing Kernel Hilbert Space with 
associated kernel function k. Let K"^ and K™ be the kernel matrices resulting from evaluating the 
kernel function for a sample of size m of points in the normal and anomalous classes respectively. 
The heterogeneity assumption is that for every integer to, there exists e e M, where < e < 1 such 

Figures lb and c show that the heterogeneity assumption is satisfied in the DNA methylation data 
for both linear and radial basis function kernels. Each figure shows the magnitude of the eigenvalues 
of the resulting kernel matrices. The magnitude of the eigenvalues in both cases is larger for the 
anomalous classes. 

The heterogeneity assumption gives us a hint to construct classifiers that deal with the heterogeneity 
of the anomalous classes. In Section [T4| we show that heterogeneity has an impact on robustness and 
stability of classifiers built from training sets of the anomalous classes. Our goal is to use samples 
from the stable normal class to create classifiers that are robust. We describe the anti-profile SVM 
as an extension to Support Vector Machines that accomplishes this goal. 



3 The anti-profile SVM 



Support Vector Machines(SVMs) are one of the primary machine learning tools used for classi- 
fication. SVMs operate by learning the maximum-margin separator between two groups of data 
provided at training time. Any new observation provided to the SVM is classified by determining 
which side of the separator the new observation lies in. An important advantage of SVMs is that by 
applying the kernel trick, it is possible to find a hyperplane in a higher dimensional space where the 
two given classes are linearly separable, even when they are not linearly separable in their original 
feature spaces, and by virtue of the kernel trick this computation can be performed at no significant 
cost. While primarily designed for binary classification, SVMs have been extended for many other 
problems, such as multi-class classification and function estimation. 



3.1 Tlie SVM as weiglitlied voting of basis functions 

Here we review SVMs from a function approximation perspective lfT4l : consider a set of n obser- 
vations, each observation being drawn from X Y.Y , where X S W , and Y S {—1, 1}. Here -p is 
the number of features in each observation, or the dimensionality of the feature space. Thus each 
observation consists of a pair {xi^yi), Xi £ M.P and yi e { — 1, 1}, for i = here yi indicates 
which of the two classes the observation belongs to. If we introduce a new observation x' which 
needs to be classified, then the classification problem amounts to comparing x' to the existing set of 
points and combining the comparisons to make a decision. 

To make the comparisons between observations, we make use of a similarity function. Let k{xi, Xj) 
be a positive-definite similarity function which compares two points Xi,Xj E Rp. Weighing the 
similarity of the new observation to each existing observation, the difference of the sum of weighted 
similarities for the two groups will provide the necessary classification: g{x) — rf+X]r=i Xi). 
Here Ci > Vi is the weight associated with each point, and d is a bias term. Classification is then 
based on the sign of the expansion: f{x) — sgn [g{x)]. 

Usually in SVMs function k is further assumed to have the reproducing property in a Reproducing 
Kernel Hilbert Space H associated with k: (/, k{x, ■))-}{ = f{x) for all / e and in particular 
{k{x, •), k{y, — k{x, y). In this case, the basis functions in the classifier correspond to repre- 
senters k{x,-). In the standard SVM, the representers of all training points are potentially used as 
basis functions in the classifier, but effectively only a small number of representers are used as basis 
functions, namely the Support Vectors. However, for a given problem, we may choose a different 
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set of points for the derivation of the set of basis functions; the basis functions determine how the 
similarities are measured for a new point. 

3.2 The Anti-Profile SVM optimization problem 

The core idea in the anti-rofile SVM (apSVM) is to make use of this characterization of the Support 
Vector Machine as a linear expansion of basis functions defined by representers of training samples. 
In order to address the heterogeneity assumption underlying the anomaly classification problem we 
define basis functions only using samples from the stable normal class. 

Formally, we restrict the set of functions available to define the subspace of H spanned by the 
representers of samples zi, . . . , z,„ from normal class Z: f{x) — d + J^iLi Cik{zi, x). To estimate 
coefficients Ci in the basis expansion we apply the usual regularized risk functional based on hinge 
loss 

1 A 

where (•)+ — max(0, •), f{x) is defined as f{x) = d + h{x), and A > is a regularization 
parameter By the reproducing kernel property, we have in this case that = c'K„c where if„ 

is the kernel matrix defined on the m normal samples. 

The minimizer of the empirical risk functional is given by the solution of a quadratic optimization 
problem, similar to the standard SVM, but with two kernel matrices used: Kn, defined in the pre- 
vious paragraph, and Kg, which contains the evaluation of kernel function k between anomalous 
samples xi, . . . ,Xn and normal samples zi, . . . , Zm- 

min e^C+¥c^^«c (1) 

s.t. Y{KsC + de)+£,>e,£,>0 

Here we use slack variables ^ = (^i, •••7 ^n)'^ denote the unit vector of size n as e, and define 
matrix Y as the diagonal matrix such that Ya ~ Ui. 

3.3 Solving the apSVM optimization problem 

The Lagrangian of problem [T] is given by 

i(c, d, ^, a, /3) = e^i + ^c^if„c - [Y{KsC + de) + f - e] - 

where q;„xi = (ai, an)"^ and /3„xi = Ps)^ are the Lagrangian multipliers. Minimizing 

with respect to z, c and d, we find that the Wolfe dual of problem [T] is 

max e^a — i^a^YKYa (2) 
s.t. 0<a<e,e^ra = 

where K ~ KgK^^ . Here we assume K^^ represents a pseudo-inverse in the case where Kn 
is not positive definite. 

For a standard SVM, the objective of the Wolfe dual is e'^a — i^^a^YKYa, with K the kernel ma- 
trix the training datapoints. Thus the dual problem of the apSVM has the same form as the standard 
SVM dual problem with the exception that kernel matrix K is replaced by induced kernel matrix K 
in the apSVM. Kernel matrix K essentially represents an indirect kernel between anomalous sam- 
ples induced by the set of basis functions determined by the samples from the normal class. Since 
the essential form of the SVM solution is unchanged by the modification, this provides the addi- 
tional advantage that the modified SVM can be solved by the same tools that solve a regular SVM, 
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but with a different kernel matrix provided. For our particular problem domain, we use the indirect 
kernel to represent deviation from the profile of normal samples, and thus refer to this classifier as 
the anti -profile SVM. 

3.4 Characterizing the indirect kernel 

We saw above that the apSVM can be solved as a standard SVM with induced kernel K = 
KsK^^KJ . In this section we characterize this indirect kernel, and state a general result that eluci- 
dates how the apSVM can produce classifiers that are more robust and reproducible that a standard 
SVM in this setting. 

Proposition 1. Let Pz be the linear operator that projects representers k{x, .) € to the space 
spanned by the representers of the m normal samples of the anomaly classification problem. Induced 
kernel k satisfies k{x, y) = k{Pzk{x, .), Pzk{y, .)). 

Proof. Projection Pzk{x, .) is defined as Pzk{x, .) = $ik{zi, ■) where 



/3 = argmin -\\k{x,. ) -Y^l3k{z„.)\\%, 

■i 

= argmm^(fc(2;, .) - ^ /3A:(z, .), .) - ^(ik{z„ .))^ 

i i 

= argmini .), ^(zj, - ^(fc(a;, .), .)),^ ^ 

= argmmi(/3^i^„/3-C/3), (3) 

where kzx is the vector with element i equal to k{zi^ x). From (3) we get /3 = K~^kzx- Therefore 
{Pzk{x,.),Pzk{y,.))jtf = k^^K-'^kzy = □ 

This proposition states that the indirect kernel is the inner product in Reproducing Kernel Hilbert 
Space between the representers of anomalous samples projected to the space spanned by the 
representers of normal samples. By the heterogeneity assumption of Definition[T] the space spanned 
by any subset of anomalous samples will be smaller after the projection. In particular, the smallest 
sphere enclosing the projected representers will be smaller, and from results such as the Vapnik- 
Chapelle support vector span rule ifTSl . classifiers built from this projection will be more robust and 
stable. 



4 Simulation Study 

We first present simulation results that show that the apS VM obtains better accuracy in the anomaly 
classification setting while providing stable and robust classifiers. We generated samples from three 
normal distributions as follows: if v4+ and are the anomalous classes that we need to distinguish, 
and Z is the normal class, then for a given feature we draw datapoints from distributions Z = 
N{0,a%),A^ = iV(0,CT^_) and A+ = N{0,a^+). To simulate our problem setting, we set 

Results have been obtained from tests written on R (version 2.14) with R packages kemlab (version 
0.9-14) |7| and svmpath (version 0.952). The svmpath tool provides a fitting for the entire path of 
the SVM solution to a model at little additional computational cost |4|. Using the resulting fit, the 
SVM classifications for any given cost parameter can be obtained. For our experiments, the testing 
set accuracy was computed for each value of cost along the regularization path, and the best accuracy 
possible was obtained; ties were broken by considering the option with the least number of support 
vectors used. Note that a small ridge parameter (le-10) was used in the svmpath method to avoid 
singularities in computations. 
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Figure 2: (A) Accuracy results in simulated anomaly classification data. The anti-profile SVM 
achieves better accuracy than the standard SVM. (B) Stability results in simulated data. The anti- 
profile SVM uses a smaller proportion of training points as support vectors. SVMs that use fewer 
support vectors are more robust and stable. 



Each training set contained 20 samples from each of and yl+ classes, while each testing set 
contained 5 samples from each class; 20 samples from class Z were used for the anti-profile SVM. 
For a given number of features, each test was run 10 times and the mean accuracy computed. To 
estimate the hyperparameter for the radial basis kernel, the inverse of the mean distance between 5 
normal and 5 anomalous samples (chosen randomly) was used. 

Figure 2a shows the accuracy of a standard SVM and the apSVM using an RBF kernel for simulated 
data with az ~ 1, o^a- — 2, (7^+ — 4. With a radial basis kernel, the anti-profile SVM was able to 
achieve better classification than the regular SVM. 

We characterize the stability of a classifier using the proportion of training samples that are selected 
as support vectors. Classifiers that use a small proportion of points as support vectors are more robust 
and stable to stochasticity in the sampling process. The more support vectors used by an SVM, the 
more likely it is that the classification boundary will change with any changes in the training data. 
Hence a boundary that is defined by only a few support vectors will result in a more robust, reliable 
SVM. Figure 2b shows that in the simulation study the apSVM used fewer support vectors than the 
standard SVM while obtaining better accuracy. 



5 Application to cancer genomics 



The motivation for this work is from recent studies of epigenetic mechanisms of cancer Epigenet- 
ics is the study of mechanisms by which the expression level of a gene (i.e. the degree to which a 
gene can exert it's influence) can be modified without any change in the underlying DNA sequence. 
Recent results show that certain changes in DNA methylation are closely associated with the occur- 
rence of multiple cancer types |3|. In particular, the existence of highly-variable DNA-methylated 
regions in cancer as compared to normals(i.e. healthy tissue) has been shown. Furthermore, these 
highly-variable regions are associated with tissue differentiation, and are present across multiple 
cancer types. Another important observation made there is that adenomas, which are benign tumors, 
show intermediate levels of hyper-variability in the same DNA-methylated regions as compared to 
cancer and normals. 

This presents an interesting machine learning problem: distinguishing between cancer and adenoma 
based on the hyper-variability of their methylation levels with respect to normal samples? A suc- 
cessful tool that can classify between the two groups can have far-reaching benefits in the area of 
personalized medicine and diagnostics. Since the two classes are essentially differentiated by the de- 
gree of variability they exhibit with respect to normals, for our purpose we can abstract the problem 
to the setting we present here as anomaly classification. 
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Figure 3: Classification results for DNA methylation in cancer |I3|. While both a standard SVM and 
the anti-profile SVM achieve similar accuracy using an RBF kernel (A), the anti-profile SVM uses 
much fewer support vectors. 



5.1 Methylation data results 



We study the performance of the apSVM in a dataset of DNA methylation measurements obtained 
for colon tissue from 25 healthy samples, 19 adenoma samples, and 16 cancer samples, for 384 spe- 
cific positions in the human genome [3 1. As mentioned previously, the cancer samples exhibit higher 
variance than healthy samples, with adenoma samples showing an intermediate level of variability 
(Figure 1). We used the same classification methods mentioned in the previous section, but with 
multiple runs, for each run randomly choosing 80% of tumor samples for training and the remaining 
for testing. Figure 3 shows the results obtained using a radial basis kernel. While the indirect kernel 
performs either at the same level or marginally better than the regular kernel, then anti-profile SVM 
uses much less support vectors than the standard SVM, thus providing a much more robust classifier. 



5.2 Expression data results 



We further applied our method to gene expression data obtained with a clinical experiment on 
adrenocortical cancer |2|. The data contains expression levels for 54675 probesets, for 10 healthy 
samples, 22 adenoma samples, and 32 cancer samples. The data shows the same pattern with regard 
to hyper-variability as the methylation data. Using the same methods as before, the results obtained 
using a linear kernel are shown in Figure 4. For feature selection, the features were ranked according 
to log ™ {Carcinoma) ^ ^ given number n as the number of features to be used, n features with 
the highest variance ratio were selected. While both the standard SVM and the apSVM provided 
almost perfect classification, there is a significant difference in the number of support vectors used, 
with the indirect kernel requiring much fewer support vectors and hence providing a more stable 
classifier. 



6 Discussion 



We have introduced the anti-profile Support Vector Machine as a novel algorithm to address the 
anomaly classification problem. We have shown that under the assumption that the classes we 
are trying to distinguish with a classifier are heterogeneous with respect to a third stable class, 
we can define a Support Vector Machine based on an indirect kernel using the stable class. We 
have shown that the dual of the apSVM optimization problem is equivalent to that of the standard 
SVM with the addition of an indirect kernel that measures similarity of anomalous samples through 
similarity to the stable normal class. Furthermore, we have characterized this indirect kernel as the 
inner product in a Reproducing Kernel Hilbert Space between representers that are projected to the 
subspace spanned by the representers of the normal samples. This led to the result that the apSVM 
will learn classifiers that are more robust and stable than a standard SVM in this learning setting. 
We have shown by simulation and application to cancer genomics datasets that the anti-profile SVM 
does in fact produce classifiers that are more accurate and stable than the standard SVM in this 
setting. 
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Figure 4: Classification results for gene expression in cancer [2 |. Similar to Figure 3, the accuracy 
of both the standard and anti-profile SVM is similar (in this case almost perfect testset accuracy 
is achieved by both classifiers). However, the anti-profile SVM again uses fewer support vectors, 
leading to classifiers that are more robust and stable. 



While the motivation and examples provided here are based on cancer genomics we expect that the 
anomaly classification setting is applicable to other areas. In particular, we have started looking at 
the area of statistical debugging as a suitable application 1 16l|. 

The characterization of the indirect kernel through projection to the normal subspace also suggests 
other possible classifiers suitable to this task. For instance, by defining a margin based on the 
projection distance directly. Furthermore, connections to kernel methods for quantile estimation ifTTIl 
will be interesting to explore. 

Another direction of interesting research would be to further solidify the stability characterization we 
provide in Section [T4| For instance, by exploring the relationship to other leave-one-out bounds IS 
|6lil5,,5J, and the span rule for kernel quantile estimation ifTOll . 
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